Read Graphical Data Analysis with R, Ch. 6, 7

Grading is based both on your graphs and verbal explanations. All graphs must have informative titles and follow all best practices as discussed in class.

Labels do not have to perfect but they have to be legible. Often it is helpful to shorten or abbreviate labels: this can be done before plotting or within the plot functions. Be sure to include all adjustments in your scripts.

1. penquins

We will use the penguins_raw dataset from the palmerpenguins package.

  1. Draw a parallel coordinates plot of the numeric columns in the dataset using ggparcoord from the GGally package. Choose parameters to help identify trends. What do you observe?
library(GGally)
library(palmerpenguins)
data <- Filter(is.numeric, penguins_raw)
ggparcoord(data, columns = 1 : 7, alphaLines =.3, scale ="uniminmax") + geom_vline(xintercept =1:7, color ="lightblue")

Sample number and culmen length, Body Mass (g) and Delta 15 N (o/oo), tend to have negative correlation. Flipper length is either high or low not much in the middle at .50. Average culmen legnth is low with some outliers having high culmen length.

  1. Experiment with using color to separate factor levels of the factor variables (one at a time, not in the same graph). Include the plot which shows the most distinct clusters. Briefly describe how the clusters differ from each other. (The factor variable should not be included as an axis in the parallel coordinates plot.)
#groupby Island
ggparcoord(penguins_raw, columns = c(2, 10:13, 15:16), groupColumn = 5, alphaLines =.3, scale ="uniminmax")

Penguins from Biscoe island tend to have higher Culum length and lower culum depth & Delta 15N compared with other Island. Penguins from Dream and Togersen always have similar characteristics.

2. pulitzer

  1. Draw an interactive parallel coordinates plot of the pulitzer dataset in the fivethirtyeight package with brushing using the parcoords() function in the parcoords package.
library(dplyr)
library(fivethirtyeight)
library(parcoords)
pulitzer %>% arrange(newspaper) %>%
  parcoords(
    rownames = F, 
    brushMode = "1D-axes", 
    reorderable = T, 
    queue = T)
  1. Which newspapers appear to be multivariate outliers? Briefly describe their unusal patterns.

New York Times, USA Today, Los Angeles Times, and Wall Street Journal. These journals have a high number of pulitzer prizes in all years compared to the other journals. They also have high daily circulation in both 2004 and 2013 than most other journals.

  1. Choose one of the newspapers and research it online to gain a deeper understanding of its uniqueness. Provide a brief summary of your results. Cite your sources by linking to the pages where you found the information.

3. doctor visits

For this problem we will use the DoctorVisits data set in the AER package. (Note: for this package you need data("DoctorVisits") to load the data.)

  1. We will consider visits to be the dependent variable. Create a new factor column, numvisits, based on visits, that contains three levels: “0”, “1”, and “2+”. (Tidyverse option: forcats::fct_collapse()). Why is this a reasonable choice?
library('AER')
library(forcats)
data("DoctorVisits")
DoctorVisits$visits <- as.factor(DoctorVisits$visits)
DoctorVisits <- DoctorVisits %>%
  mutate(numvisits = fct_collapse(visits,
    '0' = c("0"),
    '1' = c("1"),
    '2+' = c("2", "3", "4", "5", "6", "7", "8", "9")
  ))
head(DoctorVisits)
##   visits gender  age income illness reduced health private freepoor freerepat
## 1      1 female 0.19   0.55       1       4      1     yes       no        no
## 2      1 female 0.19   0.45       1       2      1     yes       no        no
## 3      1   male 0.19   0.90       3       0      0      no       no        no
## 4      1   male 0.19   0.15       1       0      0      no       no        no
## 5      1   male 0.19   0.45       2       5      1      no       no        no
## 6      1 female 0.19   0.35       5       1      9      no       no        no
##   nchronic lchronic numvisits
## 1       no       no         1
## 2       no       no         1
## 3       no       no         1
## 4       no       no         1
## 5      yes       no         1
## 6      yes       no         1

Because people who visit more than or equals to 2 means he/she is a re-visit patient so they can be allocated to one category. Those who visited once may just for some specific reason.

  1. Draw a mosaic pairs plot of all factor variables in DoctorVisits. Based on the plot, which variables appear to have a strong association with numvisits? medium association? weak or no association? (Make your judgments relative to the other variables.)
count <- DoctorVisits %>%
  group_by(gender, private, freepoor, freerepat, nchronic, lchronic, numvisits) %>%
  summarise(Freq = n())
head(count)
## # A tibble: 6 x 8
## # Groups:   gender, private, freepoor, freerepat, nchronic, lchronic [2]
##   gender private freepoor freerepat nchronic lchronic numvisits  Freq
##   <fct>  <fct>   <fct>    <fct>     <fct>    <fct>    <fct>     <int>
## 1 male   no      no       no        no       no       0           624
## 2 male   no      no       no        no       no       1            51
## 3 male   no      no       no        no       no       2+           23
## 4 male   no      no       no        no       yes      0            61
## 5 male   no      no       no        no       yes      1            11
## 6 male   no      no       no        no       yes      2+            3
factor_summary <- DoctorVisits %>% group_by(gender, private, freepoor, freerepat, nchronic, lchronic, numvisits) %>% tally()
pairs(xtabs(n ~ ., factor_summary))

  1. Are p-values from \(\chi^2\) (chi-squared) tests of each of the variables and numvisits consistent with your categorization in part (b)? Explain briefly.
X1 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$age)
X2 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$income)
X3 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$illness)
X4 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$reduced)
X5 <- chisq.test(DoctorVisits$numvisits, DoctorVisits$health)
X1
## 
##  Pearson's Chi-squared test
## 
## data:  DoctorVisits$numvisits and DoctorVisits$age
## X-squared = 154.76, df = 22, p-value < 2.2e-16
X2
## 
##  Pearson's Chi-squared test
## 
## data:  DoctorVisits$numvisits and DoctorVisits$income
## X-squared = 97.044, df = 26, p-value = 3.961e-10
X3
## 
##  Pearson's Chi-squared test
## 
## data:  DoctorVisits$numvisits and DoctorVisits$illness
## X-squared = 389.35, df = 10, p-value < 2.2e-16
X4
## 
##  Pearson's Chi-squared test
## 
## data:  DoctorVisits$numvisits and DoctorVisits$reduced
## X-squared = 965.2, df = 28, p-value < 2.2e-16
X5
## 
##  Pearson's Chi-squared test
## 
## data:  DoctorVisits$numvisits and DoctorVisits$health
## X-squared = 244.34, df = 24, p-value < 2.2e-16
  1. Draw mosaic plots of gender, lchronic, private and numvisits in stages:
vcd::mosaic(numvisits ~ gender,
       direction = c("v", "h"), highlighting_fill = c("grey90", "cornflowerblue"), count)

vcd::mosaic(numvisits ~ gender + lchronic,
       direction = c("v", "v", "h"), highlighting_fill = c("red", "cornflowerblue"), count)

vcd::mosaic(numvisits ~ gender + lchronic + private,
       direction = c("v", "v", "v", "h"), highlighting_fill = c("grey90", "cornflowerblue"), count)

All cuts should be vertical except the last one. The last cut should be the dependent variable. Use appropriate fill colors. What patterns do you observe?

Gender, provate, and Ichronic may not be a good feature for split.

  1. Use geom_alluvium() from the ggalluvial package to display the same variables as in the last graph of part (d). What new information / perspective does the alluvial plot provide, if any?
library(ggalluvial)
ggplot(as.data.frame(count), aes(y = Freq, axis1 = lchronic, axis2 = private)) +   
  geom_alluvium(aes(fill = lchronic), width = 1/12)

Those who is Ichronic and those who not have totally different patterns.

4. likert data

  1. Find data online with at least three categorical variables whose levels form the same spectrum from one pole to its opposite. A common example is likert survey responses with categories such as “Strongly agree”, “Agree”, “Neutral”, “Disagree”, “Strongly disagree”, but your choice does not need to involve likert data. Provide a link to your data source.
data <- read.csv('https://raw.githubusercontent.com/nychealth/coronavirus-data/master/by-age.csv')
  1. If possible read your data directly from the site. If not (for example if it is in pdf form), create a data frame in your code, or read it in from a data file that you’ve created. Include the file with your submission. Display a table of your data in summarized form.
head(data)
##   AGE_GROUP CASE_RATE HOSPITALIZED_RATE DEATH_RATE CASE_COUNT
## 1      0-17    575.53             40.42       0.87       9939
## 2     18-44   2878.40            294.95      22.23      96993
## 3     45-64   4190.32            931.69     210.21      86155
## 4     65-74   4113.90           1807.99     681.57      28761
## 5       75+   4864.57           2916.95    1725.86      26608
## 6  Citywide   2964.07            695.22     229.70     248945
##   HOSPITALIZED_COUNT DEATH_COUNT
## 1                698          15
## 2               9939         749
## 3              19156        4322
## 4              12640        4765
## 5              15955        9440
## 6              58390       19292
  1. Draw a diverging stacked bar chart of your data, following all best practices. You may use any package.
HH::likert(AGE_GROUP~., data, positive.order = TRUE,
           main = 'NYC Covid-19 test result by age',
           xlab = 'percent', ylab = 'age group')